perf: replace Gauss-Jordan with Cholesky precision sampler and pre-compute loop invariants#10
Merged
Merged
Conversation
…tation Tests verify sampler output correctness, determinism, spike-and-slab behavior, many-covariate numerical stability (k=20), speed benchmarks, and seasonal regression - all passing on current code as baseline before refactoring.
…ling
Replace private cholesky() with public cholesky_lower(), add forward_solve,
backward_solve_lt, chol_solve_lower, and sample_from_precision. Remove
sample_mvnormal (no remaining callers after sampler.rs migration).
sample_from_precision samples beta ~ N(A^{-1}b, sigma2 * A^{-1}) using
a single Cholesky factorization of the precision matrix A, matching the
approach used by R's bsts package. This replaces the previous approach
of explicit Gauss-Jordan inversion + separate mvnormal sampling.
- Pre-compute cross_product_matrix(X, T) once before the loop instead of every iteration (eliminates O(k^2 T) per iteration) - Pre-compute spike-and-slab x_mean and n_j per covariate (O(1) lookup instead of O(T) per covariate per iteration) - Replace invert_matrix + sample_mvnormal with sample_from_precision (single Cholesky instead of Gauss-Jordan + second Cholesky) - Delete invert_matrix, scale_matrix (no remaining callers)
Add [profile.release] with thin LTO, single codegen unit, and panic=abort to maximize cross-function inlining and reduce binary size.
Remove unused pytest import and fix line length violations.
These are implementation details of sample_from_precision, not part of the public API. No external callers exist outside distributions.rs. Reduces public API surface per refactor-cleaner review.
- Remove panic="abort" from [profile.release] to preserve PyO3's panic catch mechanism (prevents Python process crash on Rust panic) - Eliminate xtx pre.clone() by building posterior_precision directly from xtx_ref + prior_precision (avoids k*k Vec clone per iteration) - Change xtx_precomputed type from Option<&Vec<Vec<f64>>> to Option<&[Vec<f64>]> (idiomatic Rust, use .as_deref() at call sites) - Skip xtx_static computation when spike-and-slab is active (coordinate- wise sampling does not use XtX) - Remove extra blank lines left from scale_matrix deletion
- Remove no-op loop in spike-and-slab n_j < 1e-12 guard: beta[j] is 0.0 so x_col[t] * 0.0 = 0.0 never changes the residual - Add #[allow(clippy::too_many_arguments)] to sample_state_path to eliminate the last remaining clippy warning
YuminosukeSato
added a commit
that referenced
this pull request
Mar 23, 2026
perf: replace Gauss-Jordan with Cholesky precision sampler and pre-compute loop invariants
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the beta sampling algorithm from Gauss-Jordan matrix inversion + separate multivariate normal sampling (2× O(k³)) with a single Cholesky factorization of the precision matrix (O(k³/6)), matching the approach used by R's bsts package. Additionally, pre-compute loop-invariant values (X^TX matrices and spike-and-slab statistics) before the Gibbs loop.
Motivation
cross_product_matrix(X, T)was called every Gibbs iteration despite X being constant. For k=20, T=400, niter=1000, this wastes 160M floating-point operations.invert_matrix()(Gauss-Jordan, O(k³)) followed bysample_mvnormal()(internal Cholesky, O(k³)) = 2× O(k³) per iteration. The newsample_from_precision()does a single Cholesky of the precision matrix.x_meanandn_j = Σ(x - x̄)²are constants recomputed O(kT) per iteration.Changes
src/distributions.rscholesky()→cholesky_lower()with doc commentsforward_solve(),backward_solve_lt(),chol_solve_lower()for triangular system solvingsample_from_precision()— samples β ~ N(A⁻¹b, σ²A⁻¹) via Cholesky of precision Asample_mvnormal()(no remaining callers)src/sampler.rsxtx_static,xtx_seasonalbefore the Gibbs loop (O(k²T) × 1 instead of × niter)slab_stats(x_mean, n_j per covariate) before the Gibbs loopinvert_matrix+sample_mvnormalwithsample_from_precisioninsample_beta_with_normal_priorxtx_precomputed: Option<&[Vec<f64>]>parameter with fallback for backward compatibilityprecomputed_stats: &[(f64, f64)]parameter tosample_spike_and_slabresidual -= x * 0.0)invert_matrix(),scale_matrix(),test_invert_identity(no remaining callers)Cargo.toml[profile.release]withlto = "thin"andcodegen-units = 1tests/test_rust_speedup.py(new)Benchmark Results
Absolute times are small because the existing Rust code already uses SIMD (
target-cpu=native). The algorithmic improvement (O(k²T) → O(1) XtX, 2×O(k³) → O(k³/6) Cholesky) shows measurable improvement at larger k. The primary benefit is R numerical compatibility.Test Plan
cargo test— 36 Rust unit tests pass (including 8 new Cholesky tests).venv/bin/pytest tests/ -v— 224 Python tests pass (including 12 new speedup tests)cargo clippy— 0 warnings